For this project the data from the “VII Encuesta de Presupuestos Familiares” (VII Household Budget Survey) was selected. This is a survey done every 5 years in Chile by means of a survey to a representative sample of homes, which allows to know the expenditure structure and consumption patterns of the Chilean families and individuals. The main use for this survey the construction of a list of goods and services whose costs are monitored monthly to calculate the “IPC”, or consumer price index, which is the main index used to measure inflation in Chile. However, this survey is also an important tool that gathers and collects an import ammount of socioeconomic information for urban households and their inhabitants, including age, gender, housing tenure, incomes, education and working conditions.
The data used for this work is splited in two data sets:
households contains 43 variables where each observation (row) represent an inhabitant. Some of the variables are related to the individuals, like age, marital status, education, gender, incomes, etc.; other variables are an aggregation over the households the inhabitans belong to (home.id variable), summarizing the household’s total income and expenses.
expenses has 10 variables, with one observation by each expenditure monitored by the survey. All of the variables are categorical, except for the numericla variable expense that records the amount expended in each transaction. It also includes the home.id variable that allows to join this data with the households in the first data set.
Some data wrangling will be needed before starting some of the exploratory data analysis, given that the households data set have one entry by each household member, while the expenses data set contains the expenses only by household not separated by household inhabitant: this means that is possible to merge the data by matching the household ids, but not by individuals (because that was the intended use for the data).
The following libraries were used for this work:
ggplot2gridExtraGGallyggthemesdplyrtidyrknitrThe data is stored in RData files, after being transformed from SPSS data sets.
load("households.RData")
load("expenses.RData")
Some cleaning is still needed, for example there two negative ages and some households without a total income reported, because of missing data.
households <- subset(households, age >= 0 & !is.na(income.hh.av.rent))
Let’s start with some simple explorations of the population in our data set. From the variables descriptions we decided to focus on the following variables:
households$age).households$edu.level).households$kinship).households$edependence).households$head.exp).households$tph).households$income.hh.av.rent).households$num.inhabitants).What is the population’s age distribution? The summary function can give us a start:
| Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. |
|---|---|---|---|---|---|
| 1 | 17 | 32 | 34.9 | 51 | 103 |
So, the average population’s age is 32 years old. However, a plot will give us much more information about the age distribution of the population surveyed in our data set. The next figure shows an histogram using the age variable (a discrete numerical variable). The binwidths are equal to 1 year. The plot shows that the population is not normally distributed and positevely skewed overall. This is expected, since the population must decrese with age as people dies by accidents, illness or natural causes.
However is interesting to notice some peaks at around 5, 25 and 50 years of age: they might correspond to generations with higher natality rates or less infant mortality.
The educational attainment measures the educational level attained by an individual. In our data set the variable that measures this is called edu.level, and is a categorical variable stored as a factor with 16 levels plus a “NA” for missing information. The next figure shows a bar plot with the distribution of this variable for our population. A logarithmic scale was used in the Y axis to display better the differences, as some categories have a small number of cases.
The variable kinship is a categorical variable stored as a factor vector with 14 levels indicating the kinship relation of the different individuals with respect to the household head. We present a figure with a bar plot to analyse this variable distribution. Notice that again the Y axis scale is logarithmic so we don’t loose information given the small number of cases of some relationships.
Most of the inhabitants of a household have a direct relation with the household head: they are either the children or the spouse in most cases.
The household tenure studies the kind of tenure a houshold has over its main dwelling place. tph is a categorical variable which again is stored as a factor vector, with 9 levels plus a “NA” category for missing information.
The following figure shows a bar plot with the percentage each category represents over the total number of households surveyed. Over 60% of the households are either fully owned or owned through a mortage still being payed. I was expecting a higher percentage of households paying a rent, however only around the 15% of the households fall in this category.
We analyse here the monthly household’s income. A household’s income adds up all the incomes from the inhabitans of a household, including the incomes from dependent work activities, independant works, rents, social helps, financial instruments, pentions, etc. Our data set includes 10517 households. As shown in the following table, the summary of this variable tell us that the median income is US$1110 while the mean is US$1707, and 75% of the households earn less than US$1976.
| Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. |
|---|---|---|---|---|---|
| 2.96 | 653.4 | 1110 | 1707 | 1976 | 53280 |
The next figure plots an histogram of the household’s income.
We can see that most of the data in under US$2000. For the next figure we change the X axis scale from linear to logarithmic, and we use geom_density instead of geom_histogram.
The plot shows a better description of the household’s income. We can see the peak at around US$1100, which is the median of the income distribution. However this plot can be misleading for some readers, because it could lead them to believe that the wealth is well distributed within the chilean’s households. —
So, we plot once again a histogram with linear scales in both axis, but we will remove the 10% of the households with higher incomes. The next figure shows the result.
The reader can wonder about the 10 kind of arbitrary ticks used in the X axis. These values represent the limits of the income deciles for the chilean household’s incomes. The following table ilustrate these deciles.
| 0% | 10% | 20% | 30% | 40% | 50% | 60% | 70% | 80% | 90% | 100% | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Percentil | 0% | 10% | 20% | 30% | 40% | 50% | 60% | 70% | 80% | 90% | 100% |
| US.Dolars | 3 | 420 | 576 | 739 | 908 | 1110 | 1369 | 1714 | 2310 | 3610 | 53280 |
These deciles were calculated using the quantile function, where we looked for the limits that would separate the households in 10 different deciles, each one compromising a 10% of the households. Using this new variable that we called income.dec, we created a bar plot that is shown in the next figure. The plot shows how the wealth is distributed in the chilean households: the decile with the top 10% of households with more income is getting around the 35% of the total wealth.
The variable num.inhabitants holds the number of inhabitans by household. The following figure shows a histogram with the distribution of the inhabitant number by household.
From the following summary table we can conclude that the median population by household is 3, while the mean population is 3.389 inhabitants.
| Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. |
|---|---|---|---|---|---|
| 1 | 2 | 3 | 3.389 | 4 | 15 |
In this section we study the variable edependence. This variable holds the kind of educational institution the students in our population are attending. This variable is categorical, and is stored as factor with 12 levels, including 1 level for the individuals not studying. In the next figure we show the distribution ordered by the number of cases by institution, removing the population that is not currently attending any institution or studying.
Regarding the primary and secundary education, the next figure shows a bar plot with the percentage of students that are attending either a public school, a private school or an state subsidized institution.
The variable health.exp has the information related to how much the surveyed population is expending monthly in health, mainly expenses in health insurance, either social or private, which is mandatory for people in dependent or independent works.
| Statistic | Value |
|---|---|
| Min. | 2.402 |
| 1st Qu. | 27.24 |
| Median | 44.36 |
| Mean | 80.98 |
| 3rd Qu. | 91.5 |
| Max. | 2279 |
The previous table states that the median expenditure is US$44, with a mean of US$80.98, and a maximum amount of US$2279. How is the expenditure distributed? The following figure shows an histogram with this information.
As with other variables in previous sections, there are some outliers that prevent us to see the distribution in greater detail. So we create a new plot with a logarithmic x axis scale.
All of the variables explored in this section come from the household data set. Some of the variables are by household and other by individuals. The most interesting features were the ones analysed: population age, educational attainment, kinship, household tenure, household’s income, educational institutions and health expenditure.
From the exploration, we found out that the Chilean population is young overall, with 50% of the population under 32 years old. From the number of inhabitants by household and the relationship of the dwellers with the houshold chief we can also conclude that households are mainly composed by families with parents and childrend living together.
One of the most important conclusions is that the wealth distribution in Chile seems to present high levels or inequality, with 50% of the households earning less than US$1100 monthly, and the highest decil earning more than US$3610 and concentranting around the 35% of the wealth.
We can arrange a little bit more the plot built on Household’s inhabitants Age Distribution and create a population pyramid:
The peaks seem to change for each gender! We can also notice that there are more women (53.2%) than men (46.8%). Are the gender’s average age different?
We see a difference in the average age for both gender, with males having an overall younger population.
| Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. | |
|---|---|---|---|---|---|---|
| Men | 1 | 16 | 30 | 33.49 | 50 | 101 |
| Women | 1 | 18 | 34 | 36.15 | 52 | 103 |
Let’s test if the difference in statistical significant by using the Wilcoxon Rank Test:
| Test statistic | P value | Alternative hypothesis |
|---|---|---|
| 143282477 | 2.947e-28 * * * | two.sided |
The test confirms that there the difference in age between genders is significant with p < 0.05.
We study now the relation between educational attainment and age. The next figure explore this relation for two groups taken from the population that is currently not studying (as reported by the variable studying): one group includes the whole population (top bar plot); the other includes only the population not studying and over 30 years old. The variables used are:
studying: indicates wheter the person is currently studying.
age: individual age, in years.
edu.level: education attainment. Factor with 16 levels, plus a NA option.
The black vertical line separates the levels related to primary and secundary education (to the left) and tertiary or higher education attainment (to the right). Not much difference is seen between both plots, but we wanted to be sure that we were not including individuals that have not actually finished they education, even when they are reported as not studying at the moment of the survey.
We will study the relation between education attainment and gender, using an stacked histogram.
Because the number of women on our sample is bigger than the number of men, we will use instead the percentage of each gender for each level.
We don’t see major differences in educational attainment for different genders.
What are households expending on?
| D Code | Description |
|---|---|
| 01 | Food and non-alcoholic beverages |
| 02 | Alchoholic beverages, tobacco and narcotics |
| 03 | Clothing and footwear |
| 04 | Housing, water, electricity, gas and other fuels |
| 05 | Furnishings, household equipment and routine household maintenance |
| 06 | Health |
| 07 | Transport |
| 08 | Communication |
| 09 | Recreation and culture |
| 10 | Education |
| 11 | Restaurants and hotels |
| 12 | Miscellaneous goods and services |
So, while the mean household income is US$1707 the median is at US$1110.